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£^ \ Abstract 

a . 

A coding theorem is proved for a class of stationary channels with feedback in which the 
| output Y n = f(X™_ m , Z™_ m ) is the function of the current and past m symbols from the 

channel input X n and the stationary crgodic channel noise Z n . In particular, it is shown that 
the feedback capacity is equal to 

lim sup -I(X n -^Y n ), 

n^oo jj(x n \\y"- 1 ) n 

where I(X n — > Y n ) = yj™=i Y^Y 1-1 ) denotes the Massey directed information from 

the channel input to the output, and the supremum is taken over all causally conditioned 
distributions p(x n \\y n ~ 1 ) = Y\7=iP( x i\ x ' l ~ 1 ' J/* -1 )- The main ideas of the proof are the Shannon 
strategy for coding with side information and a new elementary coding technique for the given 
channel model without feedback, which is in a sense dual to Gallager's lossy coding of stationary 
ergodic sources. A similar approach gives a simple alternative proof of coding theorems for finite 
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. state channels by Yang-Kavcic-Tatikonda, Chen-Berger, and Permuter-Weissman-Goldsmith. 
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1 Introduction 



Shannon [M] showed that the capacity C of a memoryless channel (X,p(y\x), y), operationally 
| defined as supremum of all achievable rates [9J Section 7.5], is characterized by 

C = sup I(X;Y). (1) 

p(x) 

When the channel has memory but still maintains certain ergodic properties, then (TT]) can be 
extended to the following multi-letter expression: 

C= lim sup -I(X n ;Y n ). (2) 

For example, Dobrushin [TO] showed that the capacity formula (|2|) holds if the channel is information 
stable; see also Pinsker [33]. Further extensions and refinements of ([2]) with more general capacity 
formulas abound in the literature. For stationary channels, readers are referred to Gray and 
Ornstein [T7], Kieffer [20], and the references therein. A general formula for the capacity is given 
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by Verdu and Han [38] for arbitrary nonstationary channels that can be represented through a 
sequence of n-dimensional conditional distributions (even without any consistency requirement); 
see also Han [18]. 

For memoryless channels with feedback, it was again Shannon [35] who showed that feedback 
does not increase the capacity and hence that the feedback capacity is given by 

C FB = C = sup I \X;Y). (3) 

p(x) 

As in the case of nonfeedback capacity ([2]) , the question arises how to extend the feedback capacity 
formula ([3]) to channels with memory. The most natural candidate is the following multi-letter 
expression with directed information introduced by Massey [26] in place of the usual mutual infor- 
mation in ([2]): 

C FB = lim sup -I(X n -» Y n ) (4) 

n— »OOp(j;n||yn-l) Tl 

n 

= lim sup -^2l(X l ;Yi\Y 1 - 1 ), 

n-*-oop( a! »|| J/ n-i) n i=1 

where the supremum is taken over all n-dimensional causally conditioned probabilities 

n 

pixV 1 ) = Hp(x i \x i -\y i - 1 ) 

i=i 

= p{xi)p(x 2 \ xiiVi) ■ ■ ■ p{x n \xi, . . . , x n —\, y±, . . . , y n —i)- 

The main goal of this paper is to establish the validity of the feedback capacity formula @ for 
a reasonably general class of channels with memory, in the simplest manner. 
Massey [26] introduced the mathematical notion of directed information 

n 

I{X n -> Y n ) = J2HX i ;Y i \Y i - 1 ), 

i=\ 

and established its operational meaning by showing that the feedback capacity is upper bounded 
by the maximum normalized directed information, which can be in general tighter than the usual 
mutual information. He also showed that Q reduces to ([3|) if the channel is memoryless, and to 
([2]) if the channel is used without feedback. Kramer [23[ [24"] streamlined the notion of directed 
information further and explored many interesting properties; see also Massey and Massey |27] , 

For channels with certain structures, the validity of the feedback capacity formula ([!]) has been 
established implicitly. For example, Cover and Pombra [8] gives a multi-letter characterization 
of the Gaussian feedback capacity, and Alajaji [1] characterizes the feedback capacity of discrete 
channels with additive noise — feedback does not increase the capacity of discrete additive channels 
when there is no input cost constraint. Both results can be recast in the form of directed information 
(see [8] Eq. (52)] and [H Eq. (17)]). The notion of directed information in these contexts, however, 
has a very limited role as an intermediate step in the proof of converse coding theorems. Indeed, the 
highlight of Cover-Pombra characterization is the asymptotic equipartition property of arbitrary 
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nonstationary nonergodic Gaussian processes [H Section V]; see also Pinsker [33]. (The case of 
discrete additive channel is trivial since the optimal input distribution is memoryless and uniform.) 

In a heroic effort [37] , Tatikonda attacked the general nonanticipatory channel with feedback by 
combining Verdu-Han formula for nonfeedback capacity, Massey directed information, and Shannon 
strategy for channel side information [36j . As the cost of generality, however, it is extremely difficult 
to establish a simple formula like Furthermore, the coding theorem in [37] is not proved in a 
completely satisfactory manner. 

More recently, Yang, Kavcic, and Tatikonda [ID] and Chen and Berger [6] studied special cases 
of finite-state channels, based on Tatikonda's framework. A finite-state channel [14\ Section 4.6] is 
described by a conditional probability distribution 

p(y n ,s n \ ). (5) 

where s n denotes the channel state at time n. Using a different approach based on Gallager's proof 
of the nonfeedback capacity [HI Section 5.9], Permuter, Weissman, and Goldsmith [31] proved 
various coding theorems for finite-state channels with feedback that include inter alia the results of 
[40116] and establish the validity of (j3|) for indecomposable finite-state channels without intersymbol 
interference (i.e., the channel states evolve as an ergodic Markov chain, independent of the channel 
input). 

As mentioned before, we strive to give a straightforward treatment of the feedback coding 
theorem. Towards this goal, this paper focuses on stationary nonanticipatory channels of the form 

— 9{X n —m: -^n—m+1 j • • • j -^iij Z n — mi Z n —in+l j • • • j Z n ). (6) 

In words, the channel output Y n at time n is given as a time- invariant deterministic function 
of channel inputs X™_ m = (X n _ m , X n _ m+ i, . . . , X n ) up to past m symbols and channel noises 
Zn-m = (Vmi Zre~m+i> • • • j Z n ) up to past m symbols. We assume the noise process {Z n }^ =1 is 
an arbitrary stationary ergodic process (without any mixing condition) independent of the message 
sent over the channel. 

The channel model ([6]) is rather simple and physically motivated. Yet this channel model is gen- 
eral enough to include many important feedback communication models such as any additive noise 
fading channels with intersymbol interference and indecomposable finite-state channels without 
intersymbol interference 

The channel ([6]) has finite input memory in the sense of Feinstein [11] and can be viewed as a 
finite- window sliding-block coder |16[ Section 9.4] of input and noise processes (cf. primitive channels 
introduced by Neuhoff and Shields [29] in which the noise process is memoryless). Compared to 
the general finite-state channel model ([5]) in which the channel has infinite input memory but the 
channel noise is memoryless, our channel model © has finite input memory but the noise has 
infinite memory; recall that there is no mixing condition on the noise process {Z n }^ =1 . Thus, the 
finite-state channel model and the finite sliding-block channel model nicely complement each other. 

Our main result is to show that the feedback capacity Cfb of the channel ([6]) is characterized by 
@ . More precisely, we consider a communication problem depicted in Figure [TJ Here one wishes 

1 A notable exception is a famous finite-state channel called the "trapdoor channel" introduced by Blackwell [3J, 
the feedback capacity of which is established in [30] . 
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w e {i,...,2 nfl }— x n (w,y™- 1 ) 



Y" — ~W„(Y Tl 



Figure 1: Feedback communication channel Y{ = g(X\_ m ,Z\_ 



to communicated a message index W € {1, 2, . . . , 2 nH } over the channel 

Y = f 0, i = l,... l m > 

4 \ 5 (Xt m ,2|_J, i = m + l,m + 2,..., 

where the time-i channel output Y{ on the output alphabet y is given by a deterministic map 
/ : X m x Z m — > 3^ of the current and past m channel inputs X\_ on the input alphabet X 
and the current and past m channel noises Z\_ m on the noise alphabet Z. We assume that the 
channel noise process {Zi} ( *L 1 is stationary ergodic and is independent of the message W. The 
initial values of Y\ , . . . , Y m are set arbitrarily. They depend on the unspecified initial condition 
(X_ m , l5 Z^_ m+1 ), the effect of which vanishes from time m + 1. Thus the long term behavior of 
the channel is independent of Y™. 

We specify a (2 nR ,n) feedback code with the encoding maps 

X n (W, r- 1 ) = (X 1 (W),X 2 (W, Fx), . . . , X n (W, Y"" 1 )), W = 1, . . . , 2"* 

and the decoding map 

nR-[ 



w n -.y n ^{i,...,2 nH }- 



(n) 

The probability of error P e is defined as 



2 n, K 



w=l 

Pr{W n (Y n ) ^ W}, 



where the message W is uniformly distributed over {1, . . . , 2 nH } and is independent of {Zi}°Z v We 
say that the rate R is achievable if there exists a sequence of (2 nR , n) codes with pj"^ — > as n — ► oo. 
The feedback capacity Cfb is defined as the supremum of all achievable rates. The nonfeedback 
capacity C is defined similarly, with codewords X n (W) = (Xi(W), . . . ,X n (W)) restricted to be a 
function of the message W only. 

We will prove the following result in Section [U 

Theorem 1. The feedback capacity Cfb of the channel (J7J) is given by 

C FB = lim sup -I{X n ^Y n ). (8) 

n^OOp( x n||j / n-l) Tl 
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Our development has two major ingredients. First, we revisit the communication problem over 
the same channel without feedback in Section [3] and prove that the nonfeedback capacity is given 



Roughly speaking, there are three flavors in the literature for the achievability proof of non- 
feedback capacity theorems. The first one is Shannon's original argument [M] based on random 
codebook generation, asymptotic equipartition property, and joint typicality decoding, which was 
made rigorous by Forney [13] and Cover [7] , and now is used widely in coding theorems for memory- 
less networks Chapter 15]. This approach, however, does not easily generalize to channels with 
memory. The second flavor is the method of random coding exponent by Gallager [15] , which was 
later applied to finite-state channels |14[ Section 5.9]. This approach is perhaps the simplest one for 
the analysis of general finite-state channels and has been adapted by Lapidoth and Telatar [25] for 
compound finite-state channels and by Permuter et al. [31 j for finite-state channels with feedback. 

The third and the least intuitive approach is Feinstein's fundamental lemma [12]. This is the 
most powerful and general method of the three, and has been applied extensively in the literature, 
say, from Khinchin [19] to Gray [16] to Verdu and Han [38j . 

Our approach is somewhat different from these three usual approaches. We use the strong typ- 
icality (relative frequency) decoding for n-dimensional super letters. A constructive coding scheme 
(up to the level of Shannon's random codebook generation) based on block ergodic decomposition of 
Nedoma |28] is developed, which uses a long codeword on the ra-letter super alphabet, constructed 
as a concatenation of n shorter codewords. While each short codeword and the corresponding out- 
put fall into their own ergodic mode, the long codeword as a whole maintains the ergodic behavior. 
To be fair, codebook construction of this type is far from new in the literature, and our method is 
intimately related to the one used by Gallager [HI Section 9.8] and Berger [21 Section 7.2] for lossy 
compression of stationary ergodic sources. Indeed, when the channel © has zero memory (m = 0), 
then the role of the input for our channel coding scheme is equivalent to the role of the covering 
channel for Gallager's source coding scheme. 

Equipped with this coding method for nonfeedback sliding-block coder channels ©, the exten- 
sion to the feedback case is relatively straightforward. The basic ingredient for this extension is the 
Shannon strategy for channels with causal side information at the transmitter [36]. As a matter 
of fact, Shannon himself observed that the major utility of his result is feedback communication. 
Following is the first sentence of [36J: 

Channels with feedback from the receiving to the transmitting point are a special case 
of a situation in which there is additional information available at the transmitter which 
may be used as an aid in the forward transmission system. 

As observed by Caire and Shamai [5j Proposition 1], the causality has no cost when the trans- 
mitter and the receiver share the same side information — in our case, the past input (if decoded 
faithfully) and the past output (received from feedback) — and the transmission can fully utilize 
this side information as if it were known a priori. 

Intuitively speaking, we can achieve the rate Ri for the ith symbol in the length-n super symbol 
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n— >oa p (x n ) n 
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as 

Ri= max I{X i ;Yp\X i - 1 ,Y i - 1 ), i = l,...,n, 
and hence the total achievable rate becomes 

n 

R= max V /(X^y™!^" 1 ,^- 1 ) 
pC^IIj/"- 1 ) r-f 

per n transmissions. Now a simple algebra shows that this rate is equal to the maximum directed 
information as follows: 

n n n 

E I(X i; YflX*- 1 , Y*- 1 ) = E E T ( X i> Y i \X i ~\Y j - 1 ) 

i=l i=l j=i 

n j 

= EE / (- x *i y ii Jf *" 1 > yi " 1 ) 

j=l i=l 
n 

= Y J I(X j ;Y j \Y^ 1 ) 

3=1 

= I(X n Y n ). (9) 

The above argument, while intuitively appealing, is not completely rigorous, however. There- 
fore, we will take more careful steps, by first proving the achievability of ^I(U n ; Y n ) for all auxiliary 
random variables U n and Shannon strategies JQ(C7j, X 4-1 , Y 4_1 ), i = 1, . . . , n, and then showing 
that I(U n ;Y n ) reduces to I(X n — > Y n ) via pure algebra. 

The next section collects all necessary lemmas that will be used subsequently in Section 02 for 
the nonfeedback coding theorem and in Section 2] for the feedback coding theorem. 

2 Preliminaries 

Here we review relevant materials from ergodic theory and information theory in the form of 10 
lemmas. While some of the lemmas are classical and are presented in order to make the paper 
self-contained, the other lemmas are crucial to our main discussion in subsequent sections and may 
contain original observations. Throughout this section, Z = {Zi}^ 1 denotes a generic stochastic 
process on a finite alphabet Z with associated probability measure P defined on Borel sets under 
the usual topology on Z°°. 

2.1 Ergodicity 

Given a stationary process Z = let T : Z°° — > Z°° be the associated measure preserving 

shift transformation. Intuitively, T maps the infinite sequence (z±, Z2, £3, • • •) to (Z2, £3, 24, . . .)• We 
say the transformation T (or the process Z itself) is ergodic if every measurable set A with TA = A 
satisfies either P(A) = or P(A) = 1. 

The following characterization of ergodicity is well known; see, for example, Petersen [321 Ex- 
ercise 2.4.4] or Wolfowitz [391 Lemma 10.3.1]. 
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Lemma 1. Suppose {Zi} ( ?i 1 be a stationary process and let T denote the associated measure pre- 
serving shift transformation. Then, {Zi} is ergodic if and only if 

^ n— 1 

lim - V P(T~M n B) = P(A) ■ P{B) for all measurable A and B. 

n— >oo n 

i=0 

When X = {Xi}^ and Z = {Zi\f = l are independent stationary ergodic processes, they are 
not necessarily jointly ergodic. For example, if we take 

_ f 01010101 . . . , with probability 1/2, 
~ [ 10101010 ... , with probability 1 /2, 

and Z is independent and identically distributed as X, then it is easy to verify that {Yi = X{ + Zi 
(mod is not ergodic. However, if one of the processes is mixing reasonably fast, then they 

are jointly ergodic. The following result states a sufficient condition for joint ergodicity. 

Lemma 2. // X is independent and identically distributed (i.i.d.), and Z is stationary ergodic, 
independent of X, then the pair (X, Z) = {pQ, Zi)} ( *L 1 is jointly stationary ergodic. 

A stronger result is true, which assumes X to be weakly mixing only. The proof is an easy 
consequence of Lemma [1] for details refer to Brown [H Proposition 1.6] or Wolfowitz [391 Theorem 
10.3.1]. 

We will later need to construct super-letter processes for our coding theorems. The next lemma 
due to Gallager [TH Lemma 9.8.2] deals with the ergodic decomposition of the n-letter super process 
that is built from a single-letter stationary ergodic one; see also Nedoma |28j and Berger [21 Section 
7.2]. 

Lemma 3. Suppose Z = {Zi} c *L 1 be stationary ergodic on Z, and let T be the associated shift 
transformation. Define the nth-order super process Z( n ) = \zS on Z n as 

%i ^ = (Zn(i-l)+li ^n(i-l)+2j • • • ; Zni), i = 1,2, . . . . 

Then, the super process Z^™) has n' ergodic modes, each with probability 1/n' and disjoint up to 
measure zero, where n' divides n. Furthermore, in the space Z°° of the original process Z, the sets 
S\, S2, ■ ■ ■ , S n i corresponding to these ergodic modes can be related by T(Si) = Si+i, 1 < i < n — 1, 
and T(S n ) = S±. 

We will use the notation P(-\Sj.), k = 1, . . . , n', for the probability measure under each ergodic 
mode. 

2.2 Strong Typicality 

We use the strong typicality [9[ Section 10.6] as the basic method of decoding. Here we review a 
few basic properties of strongly typical sequences. 

First definitions. Let N(a\x n ) denote the number of occurrences of the symbol a in the sequence 
x n . We say a sequence x n € X n is e-strongly typical (or typical in short) with respect to a 
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distribution P(x) on X if 



-N(a\x n ) - P(a) 



n 



< 



\X\ 



for all a G X with P(a) > 0, and N(a\x n ) = for all a £ X with P(a) = 0. Consistent with this 
definition, we say a pair of sequences (x n ,y n ) are jointly e-strongly typical (or jointly typical in 
short) with respect to a distribution P{x,y) on X x J 1 if 



n 



iV(a,^",y™)-P(a,&) 



< 



for all (a,b) £ X x y with P(o, b) > 0, and iV(o, b\x n ) = for all (a,b) £ X x y with P(a, 6) = 0. 

The set of strongly typical sequences x n G <Y" with respect to X ~ P(x) is denoted t4*^(X). 
We similarly define a joint typical set A*J" n \x, Y) for (X, Y) ~ P(x,y). 

The following statement is a trivial consequence of the definition of typical sequences. 

Lemma 4. Suppose X ~ P(X). // x n G ^ £ * (n) (X) and j/ n = /(x n ) := (/(xi), /(x 2 ), . . . , /On)), 
tfien y n G A* (n) (/PQ) with S = e- {\X\ - 1). 

As a special case, if (x n ,y n ) is e-strongly typical with respect to a joint distribution P{x,y), 
then x n is e-strongly typical with respect to the marginal P{x) = P(x,y). 

Our discussion on the typical sequences so far has not given a specific context on how they 
are generated. Now we connect the notion of strong typicality with ergodic processes. First, from 
Birkhoff's ergodic theorem |32} Theorem 2.2.3] and the definition of ergodicity, the following lemma 
is immediate. 

Lemma 5. Let Z = be stationary ergodic with Z\ ~ P(z). Then 

Pr(Z n G A*W(Z0) -> 1 



as n 



oo. 



As we mentioned in the previous subsection, the nth order super process Z^ n ) = {Zf 1 
defined as 



Zj {n-l)i+l-> 



1,2, 



is not necessarily ergodic, but is a mixture of disjoint ergodic modes. Thus, the super process 
Z( n ) is not necessarily typical with respect to P(z n ) on the n-letter alphabet Z n . The following 
construction by Gallager [TJl pp. 498-499], however, gives a typical sequence in the n-letter super 
alphabet by shifting through each ergodic phase. 

Lemma 6. Given positive integers n,L and a stationary ergodic process Z = {Zi}'?i 1 , construct 
Z = {Zi}^l as follows (See Figure^: 



Zi 



z%i 



z 



i+n— 1 j 



In other words, {Zi}^ is a verbatim copy of {Zj}^" 1 "™ with every (Ln + l)st position skipped. 



i = 1, . . . , Ln, 

i = Ln + 1, . . . , 2Ln, 

i = Ln(n — 1) + 1, ... , Ln 2 . 

Ln 2 +n 
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Zl 


Z2 


23 


Zi 


Z5 


Z6 


z-i 


Z8 


29 ^10 


Zll 


Z\2 


Zl3 


214 


215 


216 


217 


218 


219 


220 








































Z\ 


Z2 


Z3 


Zi 


Zh 


Z6 




Zl 


ZS Zg 


zio 


Zll 


212 




213 


214 


215 


216 


217 


218 



Figure 2: Construction of Z Lv? from Z Ln2+n : n = 3, L = 2. 

Let Z( n ) = {Zy , Z^+i, • • • , Zj™2_ n <i) be the associated nth order super process of length Ln. Then, 

Pr(Z( n 'G4 , ( in '(Z' l ))^l asL^oo. 



Proof. From Lemma [3] and the given construction of skipping one position after every Ln symbols, 
each of n sequences 

(% 

( fyLn+n 



, yLn{n-l)+n yLn 2 \ _ / yLn(n-l)+2n-l „^ n 2 +n _ 1 

\ A Ln(n-l)+l ' • • • ' ^Lrfi-n+l) ~ \^ Ln{n-l)+n ' " " " ' ^ Ln 2 > 

falls in one of ergodic modes (Si, . . . , S n >) with n/n' sequences for each mode. Now for each sequence 
with corresponding ergodic mode S^, the relative frequencies of all super symbols a n £ Z n converge 
to the corresponding distribution P(a n \Sk) as L — > oo. But each ergodic mode is visited evenly, 
each by n/n' sequences. Therefore, the relative frequencies of all a n £ Z n in the entire sequence 
Z^ n2 converge to 

1 n ' 

-Y,P{a n \S k ) = P{a n ) 
k=l 

as L — > oo. □ 

Combining Lemma [2] with the proof of Lemma (6J we have the following result. 

Lemma 7. Under the condition of Lemma® let further X = {Xi} ( ?i 1 be blockwise i.i.d.~ P(x n ), 
that is, X^ = Xfa-i)i+v i = 1,2,..., i.i.d. ~ P{x n ), independent of Z. Then, 

Pr((x( n \ ZW) £ A< Ln \x n , Z n )) -» 1 as L ^ oo. 

Finally we recall the key result linking the typicality with mutual information [£l Lemma 10.6.2]. 

Lemma 8. Suppose (X,Y) ~ P(x,y) and let X X ,X 2 , . . . ,X n be i.i.d.~ P(x). For y n € A* e {n) (Y) , 
the probability that (X n ,y n ) € A* e {n) (X,Y) is upper bounded by 

Pr((X n ,y n ) E A< n \X,Y)) < 2 ^ n ( I ( X ' Y ^ s ) 

where 5 — ► as e — ► 0. 



Ln-n+l) 



z 



2Ln -i 
2Ln-n+lJ 



\/j x , . . . , z< iri _ n+1 j 
/ 7 L(n+l) + l 



7 2Ln+l > 
Zj 2Ln-n+2> 
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2.3 Channels with Side Information 



We prove the following identity in a purely algebraic manner, then find its meaning in information 
theory. Here we assume every alphabet is finite. 

Lemma 9. Suppose S ~ p(s). For a given conditional distribution p(y\x,s) on the product space 
X x S x y, we have 

max I(U;Y,S) = max I(X;Y\S), (10) 

p(u),x=f(u,s) p{%\s) 

where the maximum on the left hand side is taken over all conditional distributions of the form 
p(u,x\s) = p(u)p(x\u, s) with deterministic p(x\u, s) (that is, p(x\u,s) = orl), and the auxiliary 
random variable U has cardinality bounded by \U\ < {\X\ — 1)|5|. 

Proof. For any joint distribution of the form p(u,x, s,y) = p(u)p(s)p(x\u, s)p(y\x, s) with deter- 
ministic p(x\u,s), we have the following Markov chains: U — > (X,S) — ► Y and X — > (U,S) — > Y. 
Combined with the independence of U and S, these Markov relationships imply that 

max I{U;Y,S)= max I(U;Y\S) (11) 

p(u),x=f(u,s) p(u),x=f(u,s) 

max I(X;Y\S). 

p(u),x=f(u,s) 

But it can be easily verified that any conditional distribution p(x\s) can be represented as 

p(x\s) = J ^2p(u)p(x\u, s) 

u 

for appropriately chosen p(u) and deterministic p(x\u, s) with cardinality of U upper bounded by 
|^| < (1^1 — l)l<5|- Therefore, we have 

max I(X; Y\S) = max I(X;Y\S), 

p(u),x=f(u,s) p{%\s) 

which proves the desired result. □ 
It is well known that the capacity of a memoryless state-dependent channel p(y\x, s) is given as 

C = m&xI(X;Y\S), 

p{x\s) 

if the state information is known at both the encoder and decoder prior to the actual communication. 
What will happen if the transmitter learns the state information on the fly, so that only the past 
and present state realization can be utilized for communication? 

Shannon [36j considered the communication over a memoryless state-dependent channel p(y\x, s) 
with state information available only at the transmitter on the fly, and showed that the capacity 
is given by 

C= max I(U;Y), (12) 

p{u),x=f{u,s) 

where the cardinality of U is bounded as \U\ < counting for all functions / : S — > X. This 

capacity is achieved by attaching a physical device X = f(U,S) in front of the actual channel as 
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u 



W,S) 



X 



p{y\x,s) 



Y 



Figure 3: Shannon strategy for coding with side information. 

depicted in Figure [3j which maps the channel state S to the channel input X according to the 
function (index) U. Now treating U as the input to the newly generated channel 

P{y\ u ) =^2p(s)p{x\u,s)p{y\x,s) 

x,s 

and coding as in the case of usual memoryless channels, we can easily achieve I(U; Y). This method, 
surprisingly simple yet optimal, is sometimes called the Shannon strategy. 

Now when the decoder also knows the channel state S, it is equivalent for the decoder to receive 
the augmented channel output Y' = (Y, S). Thus, the capacity of the same channel p(y\x, s) with 
the state information causally known at both the encoder and decoder^ follows from (fT2|) as 

C= max I(U;Y,S). 

p(u),x=f(u,s) 

Therefore, Lemma [9] states that when the same side information is available at the receiver, the 
causal encoder with the best Shannon strategy performs no worse than the noncausal encoder who 
can preselect the entire codeword compatible with the whole state sequence. 

For the last lemma needed for main results, we recall the notation of causally conditioned 
distributions 

n 

Pix^- 1 ) = l[ P (x i \x i -\y i - 1 ) (13) 



i=i 



and 



p(y n \\x n ) = l[ P (y i \x\y i - 1 ). (14) 



i=i 



(The notation (|13[) and (|14p can be unified if we define 

P{a n \\b m ) = 



lliLiKail^X -1 )) n = m, 

p(a n ||r- m r), n>m, 
p((!> m - n a n \\b m ), n<m.) 



2 For the usual block coding, the decoder causality is irrelevant. The message is decoded only after the entire block 
is received. 
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By chain rule, we have 

p(s n ||j/ n - 1 My n ||a; n ) = p(x n ,y n ) = p(x n )p(y n \x n ) 

for any joint distribution p(x n ,y n ). Thus, given a causally conditioned distribution (or a channel) 
p(y n \\x n ), the causally conditioned distribution (or the input) p(x n \\y n ~ l ) completely specifies the 
joint distribution p(x n ,y n ). 

As a corollary of Lemma [9j we have the following result. 

Lemma 10. Suppose a causally conditioned distribution p(y n \\x n ) is given. Then we have 

max I(U n ;Y n ) = max I(X n -> Y n ), (15) 

p(u n ),Xi=f(ui,x' l ~ 1 ,y l ~ 1 ) p(x n \\y n ^ 1 ) 

where the maximum on the left hand side is taken over all joint distributions of the form 

n 

p(u n ,x n ,y n ) = ]l( P (u i ) P (x i \u i ,x i -\y i - 1 ) P (y i \x i ,y i - 1 )) 

n 

= (Y[p(u i )p(x i \u ii x i -\y i - 1 j) P (y n \\x n ) (16) 

i=\ 

with deterministic p(xi\ui, y % ~ 1 ), i = 1, . . . ,n, and the auxiliary random variables Ui has the 
cardinality bounded by \Ui\ < iX^iy] 1-1 . 

Proof. Let q(u n ,x n ,y n ) be any joint distribution of the form (|16|) such that q(xi\ui, a: , y*" 1 ), 
i = 1, . . . , n are deterministic and that q{y n \ \x n ) = p{y n \ \x n ) (i.e., the joint distribution q(u n , x n , y n ) 
is consistent with the given causally conditioned distribution p(y n \\x n )). For (U n ,X n ,Y n ) ~ 
q(u n ,x n ,y n ), it is easy to verify that U- 1 is independent of (U i ,X i ,Y l ), which implies that 
U l ~ l -»■ (X*- 1 ^*- 1 ) -» Y™ forms a Markov chain. On the other hand, X 1 ' 1 is a deterministic 
function of (U l ~ l , Y l ~ l ) and thus — > (f7 1-1 , Y I_1 ) — » Y/ 1 also forms a Markov chain. Similarly, 
we have the Markovity for IP -» (JP, Y*" 1 ) -» Y™ and X* -» (J7*, Y i_1 ) -» Y™. Therefore, we have 

I(Uf, Y n \U i ~ 1 ) = I{Ui] Y/^Y*" 1 , IP' 1 ) (17) 
= H{Y l n \Y i ~ 1 1 U l ~ l ) - H(Y™\Y i ~ 1 , IP) 
= H(Y* l \Y i - 1 ,X i - 1 )-H(Y? l \Y i - 1 ,X i ) (18) 
= I(X i ;Y?\X i -\Y i - 1 ) ) 

where (|17p follows from the independence of £/j and (U 1 " 1 , Y*" 1 ), and (|18|) follows from Markov 
relationships observed above. Now from the alternative expansion of the directed information shown 
in ([9]), we have 

max I(U n ; Y n ) = maxI(X n -> Y n ). 
q q 

Finally, by using distributions of the form 

p(x i \x l ~ 1 ,y l ~ 1 ) = y j p(ui)p{xi\ui,x l ~ 1 ,y' l ~ 1 ), i = l,...,n 
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with appropriately chosen p(uj) and deterministic p(xi\ui, x l 1 ,y' 1 1 ), we can represent any causally 
conditioned distribution 

n n 
i=l u n i=l 

which implies that 

max/(X n -> y n ) = max /(X n -» Y n ) 
and completes the proof. □ 

3 Nonfeedback Coding Theorem Revisited 

This section is devoted to the proof of the following result. 
Theorem 2. The nonfeedback capacity C of the stationary channel 

yffl. i = l,...,m, 

4 \ g(Xl m ,Zi_J, i = m + l,m + 2,..., [ > 

with the input X{ and the stationary ergodic noise process {Zi} ( ^ 1 depicted in Figure\7\is given by 

C = lim C n 

n— >oo 

= lim sup -I(X n ;Y n ). (20) 

Revisiting and proving the nonfeedback coding theorem is rewarding for two reasons. First, our 
proof is somewhat different from the usual techniques and hence is interesting on its own. (See 
Section [T] for the discussion on conventional achievability proofs of nonfeedback capacity theorems.) 
Second, our exercise here will lead to a straightforward proof of the feedback coding theorem in 
the next section. 

Proof. We first note that the capacity expression (|2"Uj) is well-defined because nC n is superadditive 
(i.e., mC m + nC n < (m + n)C m + n ), which implies that the limit exists and 

lim C n = sup C n . 

n ^°° n>l 

The converse follows immediately from Fano's inequality [9] Lemma 7.9.1]. For any sequence 
of (2 nR , n) codes (X n (W), W(Y n )) with the message W drawn uniformly over {1, . . . , 2 nR }, if 

pM = Pr(W ^ W) -> 0, 

then we must have 

nR < I(W; Y n ) + ne n 
<I(X n -Y n )+ne n 



13 



X 

z 

Y 



Xi 2 2 X 3 


24 25 Xq 




zi z 2 z z 


Z4 25 Zq 



2/2 2/3 



2/4 2/5 2/6 2/7 



x(™) 

y(n) 



*i x 2 23 


X4 25 Xq 




zi z 2 z z 


Z4 Z5 Zq 



x s x 9 x w 


XU Xl2 Xi 3 


zs zg zio 


Zll Z12 213 


2/8 2/9 2/io 


2/11 2/12 2/13 


Yi — ffPQ-m) Z l-m) 


Xj X 8 Xg 


210 X11 212 


Z7 Zs Zg 


210 Zll 212 



XlS X W 217 


XlS X W 220 




215 Z16 Z17 


Zl8 Zl9 ^20 



2/14 2/15 



2/16 2/17 



2/18 2/19 2/20 2/21 



213 Xu 2i 5 


2i 6 217 Xis 




213 214 215 


216 217 218 



2/2 2/3 



2/5 2/6 



2/8 2/9 

i.i.d 



^0-l)n+l 



2/11 2/12 

p*(2«) 



2/14 2/15 



2/17 2/18 



z (i-l)n+l — J v vi (i-l)n+l' Zj (i-l)n+l/ 

Figure 4: Input, noise, and output sequences: n = 3,L = 2,m = l. 

where e n — > as n — > 00. 

For the achievability, it suffices to show that there exists a sequence of codes that achieves 
C n for each n > m. (Recall C\ = • • • C m = 0.) Without loss of generality, we assume that the 
alphabets X ', 3^, -Z are finite. Otherwise, we can partition the space for each n and e > such that 

max -I([X) n ; [Y] n ) > C n - e, 
p([x] n ) n 

and prove the achievability on this partitioned space. 

Codebook generation. Fix n > m and let p*(x n ) denote the input distribution that achieves C n . 
For each L = 1, 2, . . ., let k = k(L, n) = Ln 2 + n. We generate a sequenced of (2 kR , k) codes X k {w) 
as depicted in Figure |H 

For each w 6 {1, 2, . . . , 2 kR }, generate a codeword X( n )(w) = X Ln (w) of length Ln on the 
n-letter super alphabet X n independently according to 

Ln 

p(x Ln2 )=l[p*(x^_ 1)i+1 ). 

1=1 

We exhibit the 2 kR codewords as the rows of a matrix: 



c 



Ln(n-1) + 



xn2 kK ) x*k(2*«) ••• ^ ( ;_ 1)+1 (2 fcK ) 



Each entry in this matrix is generated i.i.d. according to p*(x n ). 



! This gives only a subsequence of (2 kR , k) codes. But we can easily interpolate to Ln 2 + n < k < (L + l)n 2 + : 



without any rate loss, since (Ln + n)/((L + l)n + n) 



1 as L — > 00. 
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Using the construction as in Lemma[6] (see Figured]), the actual codewords X(u;) = X k {w), w = 
1,2, . . . ,2 nR , which will be transmitted over the channel, are generated from X( n )(w) = X Ln \w) 
as follows: 



X \i-l)Ln+i ~ { X (i-l)Ln+H ®) ' 1 ~ X ' 2 ' 



. n. 



l (t-l)Lre+i V v {i-l) 

In other words, X fc is a verbatim copy of X Ln2 with fixed symbol separating the subsequences of 
length Ln. 

Encoding. If W = w, the transmitter sends the codeword X(u?) = X fc (u>) over the channel. 
Decoding. Upon receiving the sequence Y = Y k , the receiver forms the sequence 
of length Ln in the n- letter super alphabet y n , as depicted in Figured) 



2 



I (i-l)n+l 



(®,Y{?_ 1)n+m+1 ), i = l,...,L, 
(0^ ( r-t)n +m+2 )' i = L + l,...,2L, 



{ (0^ ( r-t)V m+ n-l). i = L(n-l) + l,...,Ln. 



Now we consider X( n ) = X Ln2 and Y^™- 1 = Y Ln2 as sequences of length Ln on the super 
alphabet X n x The receiver declares that the message W was sent if there is a unique W such 
that 

(x( n )(iu), yW) g 4*( Ln )(x n ,y n ), 

that is, (X( n )(IU), Y^ n )) is jointly typical with respect to the joint distribution p(x n ,y n ) specified 
by p* (x n )p(z n ) and the definition of the channel (|19p . Otherwise, an error is declared. 

Analysis of the probability of error. Without loss of generality, we assume W = 1 was sent. We 
define the following events: 

Ei = {(X( n) (l),Y( n )) G A< Ln \X n ,Y n )}, i G {1,2,..., 2 kR }, 

where Ei is the event that the ith. codeword and Y^" - )) are jointly typical. By Bonferonni's inequal- 
ity, we have 

Pv(W + W) = Pr(W ^ W\W = 1) 

= Pr(£f U E 2 U E 3 U • • • U E 2kR ) 

2 kR 

<Pv(E c 1 ) + Y,^(E i ). 

i=2 

In order to bound Pr(£^), we define Z^™) as the nth order super process of length Ln on the 
super alphabet Z n constructed from the noise process {Zi}'^L 1 as in Lemma [6l (See Figure HI) 
Since XW(1) is blockwise i.i.d.~ p*(x n ) and independent of Z, we have from Lemma [7] 

Pr((X( n )(l),zM) G A< Ln \X n ,Z n )) 1 as L oo. 
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Furthermore, Y( n ) is the blockwise function of (X( n )(l), Z( n )), that is, 

r (i-l)n+l ~~ ' \ (i— l)n+l V /' (£— l)n+l / 

with the time-invariant function / induced by the channel function g in (|19p . Thus by Lemma U 
Pr((X( n )(l),Y( n )) G A< Ln \X n ,Y n )) -> 1 as L — > oo, 

and 

Pr(-E'J) < e for L sufficiently large. 

On the other hand, recall that the typicality of (X^(i), Y^) implies the typicality of Y^ n ) 
(Lemma |3J). Hence, by Lemma [8] we have for each i ^ 1 

Pr(Ei) = Pr((xW(i),YW) G A*( in )) 

= ^ Pr((XW(i),yW) < 2 -Ln(/(x« ; y« +1 )^) ) 

where 5 — > as e — > 0. Consequently, 

2 fcR 

Pr(M> 7^ W) < Pr(-Ef) + J^Pr(^i) 
< 2e 

if L is sufficiently large and 

kR<Ln(I(X n ;YZ +1 )-5), 

or equivalently, 

R<^j^(l(x n ;Y- +1 )-5). 

Since e can be made arbitrarily small and (Ln 2 + n)/(Ln) — > 1/n as L — > oo, we have a sequence 
of (2 fefi , k) codes that achieves 

2? < -I(X n ;Y^ +1 ) = —I(X n ; Y n ) = C n . 
n n 

□ 

4 Proof of Theorem [I] 

Recall our channel model: 

yf ' i = l,...,m, 

1 I 5 (X|_ m ,Z|_J, i = m + l,m + 2,..., 
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with the input Xi and the stationary ergodic noise process {Zi}°^ =1 depicted in Figure [TJ We prove 
that the feedback capacity is given by 



Cfb = hm Cfb,u 

n— >oo 



lim sup -I(X n -» Y n ), (22) 

n^OOp( a; n||j / n-l) TL 



where the supremum is over all causally conditioned distributions 

n 

p(x n \\y n - 1 ) = l[p(x i \x i -\y i - 1 ). 

i=l 

We will combine the coding technique developed in the previous section with the Shannon strategy 
for channels with side information, in particular, Lemma [TU1 

That the limit in (|22p is well-defined follows from the superadditivity of nCFB,n- Thus, 

Cfb = hm Cfb,u = sup Cfb,™ • 

n— *oo n>l 

The converse was proved by Massey |26t Theorem 3]. We repeat the proof here for completeness. 
For any sequence of (2 nR ,n) codes with P e , we have from Fano's inequality 



nR < I(W;Y n ) + ne n 

n 

= Y / I (W;Y i \Y l - 1 ) + ne n 



t=l 



^I^Yil^-^ + nen (23) 



i=i 

= I(X n -> Y n ) + ne n , 

where e n — ► as n —> oo. Here ([23"]) follows from the codebook structure Xi(W, Y l ~ l ) and the 
Markovity W -> (X\ Y l ~ l ) -> Y { . 

For the achievability, we show that there exists a sequence of codes that achieves Cfb,u for each 
n. As before, we assume that the alphabets are finite. In the light of Lemma \W\ it suffices to show 
that 

C' FB , n = max I(U";Y») (24) 

is achievable, where the auxiliary random variables C/j has the cardinality bounded by \Ui\ < 
|^|i|-y|i-l^ an( j ^ e maximization is over all joint distributions of the form 

n 

1=1 

with deterministic p(xi\ui, rr 4-1 , y l ~ l ), i= 1, . . . ,n. 

Codebook generation and encoding. Fix n and let p*(xii), i = 1, . . . , n, and /* : (v,i, x l ~ l , y i_1 ) i— > 
Xi, i = 1, . . . ,ra, achieve the maximum of ([53]) . We will also use the notation p*(u n ) = Y\?=iPi ( u i) 
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Figure 5: Code, input, noise, and output sequences: n = 3, L = 2, m = 1. 

and t(u n ,x n -\y n - l ) = {r i (u l ),...J*(u n ,x n -\y n - 1 )). 

For each k = k(L, n) = Ln 2 + n, L = 1, 2, . . . , we generate a (2 fciJ , k) code Y 4-1 )}^ as 

summarized in Figure ED As before, X^, tH, and Z» are respectively related to the underlying 
sequences X, Y, Z with every [Ln + l)st symbol omitted. 

For each w £ {1,2, ... , 2 kR }, we generate a codeword \J^(w) = U Ln2 (w) of length Ln on the 
ndetter alphabet U\ X • • • x U n independently according to 

Ln 

p(u Ln2 )=Hp*(u^_ 1)t+1 ). 

i=l 

This gives a 2 kR x Ln codebook matrix with each entry drawn i.i.d. according to p*(u n ). 

To communicate the message W = w, the transmitter chooses the codeword Tj( n ) (w) = JJ Ln2 (w) 
and sends 

X (i - 1)n+j = f* (Uj (w) , X^nti' 1 ' ^(t~i 1 )n+/~ 1 ) ' i = 1, . . • , Ln, j = 1, . . . ,n. 

Thus, the code function X n (w,Y n ~ 1 ) utilizes the codeword Tj( n ) and the channel feedback Y( n ) 
only within the frame of n transmissions (each box in Figure [5]) . 

Decoding. Upon receiving Y k , the receiver declares that the message W was sent if there is a 
unique W such that 

(U'"'(lf),Y("l) e A< Ln \U n ,Y n ), 
that is, (XJ^ (W), Y"( n )) is jointly typical with respect to the joint distribution p(u n ,y n ) specified 
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by p*(u n )p(z n ), X{ = f*(ui, x l ~ l , y 4 " 1 ), and the definition of the channel (|2ip . Otherwise, an error 
is declared. 

Analysis of the probability of error. We define the following events: 

^ = {(uW(i),YW) £ A< Ln \u n ,Y n )}, ie {1,2, . . . ,2 kR }. 

As before, we assume W = 1 was sent. 

From Lemma[71 (1) and Z^ n ) are jointly typical with high probability for L sufficiently large. 
Furthermore, Y^ n ' is an n- letter blockwise function of (X( n )(l), Z^), and thus of (UW(1),ZW). 
Therefore, the probability of the event E\ that the intended codeword U( n )(l) is not jointly typical 
with Y*-" - ) vanishes as L — > oo. 

On the other hand, Tj( n \i), i ^ 1, is generated blockwise i.i.d.~ p*{u n ) independent of Y^ n \ 
Hence, from Lemma El the probability of the event Ei that \]^ n \i) is jointly typical with Y^ n ) is 
bounded by 

Pr(^) < 2 - Ln ^ un ' Y ^- 5 \ for all i + 1, 
where 5 — > as e — > 0. Consequently, we have 

Pr(VF ^ W) < Pr(E^) + ^Pr( J E i ) 

i=2 

< 2e 

if L is sufficiently large and 

kR = (Ln 2 + n)R < Ln(I(U n ; Y n ) - 8). 

Thus by letting L — > oo and then e — > 0, we can achieve any rate R < C' FBn . 
Finally by Lemma [TUl this implies that we can achieve 

C FB , n = max -I(X n ^Y n ), 
which completes the proof of Theorem [TJ 

5 Concluding Remarks 

Trading off generality off for transparency, we have focused on the stationary channels of the form 

= f(X n _ m , Z n _ m ) 

and presented a simple and constructive proof of the feedback coding theorem. The Shannon 
strategy (Lemma 1 10]) has a fundamental role in transforming the feedback coding problem into a 
nonfeedback one, which is then solved by a scalable coding scheme of constructing a long typical 
input-output sequence pair by concatenating shorter nonergodic ones with appropriate phase shifts. 
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This two-stage approach can be applied to other channel models and give a straightforward 
coding theorem. For example, we can show that the finite-state channel 

piilrn s n | s n —i , Xfi) — p{y n \s n —\, x n )p(s n |s n _i , X n , Dn) 

with deterministic p(s n \s n -i, x n , y n ) (but no assumption of indecomposability) has the feedback 
capacity lower bounded by 

Cfb > sup max min -I(X n — > Y n \s ). 

This result was previously shown by Permuter et al. \31\ Section V] via a generalization of Gallager's 
random coding exponent method for finite state channels without feedback [144 Section 5.9]. Here 
we sketch a simple alternative proof. 

From a trivial modification of Lemma PTOl the problem reduces to showing that 

max min - 1 (U n ;Y n \s ) (25) 

p(u n ),Xi=f(u i ,x'~ 1 ,y i - 1 ) so n 

is achievable for each n. But the given Shannon strategy (p*(u n ),x n = f*(u n , rr n_1 , y n ~ 1 )) induces 
a new time-invariant finite-state channel on the n-letter super alphabet as p(y~k, Sfc|sfc_i, u/%). Hence 
we can use Gallager's random coding exponent method directly to achieve 

lim maxmin i/(U fe ; Y fc |s ), 
which can be shown to be larger than our target 

i/(Ui;Yi|s ), 
n 

because of the deterministic evolution of the state S n = f(S n -i,X n ,Y n ). 

We finally mention an important question that is not dealt with in this paper. Our characteri- 
zation of the feedback capacity 

C FB = lim max -I(X n -» Y n ) (26) 

n— >oop( a; Ti||j / n-i) n 

or any similar multi-letter expressions are in general not computable and do not provide much 
insight on the structure of the capacity achieving coding scheme. One may ask whether a stationary 
or even Markov distribution is asymptotically optimal for the sequence of maximizations in ([26]). 
This problem has been solved for a few specific channel models such as certain classes of finite-state 
channels [6] 1^0] [3T] [30] and stationary additive Gaussian noise channels [22] , sometimes with 
analytic expressions for the feedback capacity. In this context, the current development is just the 
first step toward the complete characterization of the feedback capacity. 
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